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Abstract 

The literature on statistical learning for time 
series assumes the asymptotic independence 
or "mixing' of the data-generating process. 
These mixing assumptions are never tested, 
nor are there methods for estimating mixing 
rates from data. We give an estimator for 
the /3-mixing rate based on a single stationary 
sample path and show it is Li-risk consistent. 



1 Introduction 

Relaxing the assumption of independence is an active 
area of research in the statistics and machine learning 
literature. For time series, independence is replaced 
by the asymptotic independence of events far apart 
in time, or "mixing" . Mixing conditions make the de- 
pendence of the future on the past explicit, quantifying 
the decay in dependence as the future moves farther 
from the past. There are many definitions of mixing 
of varying strength with matching dependence coeffi- 
cients (see [8, 6, 3] for reviews), but most of the results 
in the learning literature focus on ^-mixing or absolute 
regularity. Roughly speaking (see Definition 2.1 below 
for a precise statement), the /3-mixing coefficient at 
lag a is the total variation distance between the actual 
joint distribution of events separated by a time steps 
and the product of their marginal distributions, i.e., 
the Li distance from independence. 

Numerous results in the statistical machine learning 
literature rely on knowledge of the /3-mixing coeffi- 
cients. As Vidyasagar [24, p. 41] notes, /3-mixing is 
"just right" for the extension of IID results to de- 
pendent data, and so recent work has consistently 
focused on it. Meir [14] derives generalization error 
bounds for nonparametric methods based on model se- 
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lection via structural risk minimization. Baraud et al. 
[1] study the finite sample risk performance of pe- 
nalized least squares regression estimators under /3- 
mixing. Lozano et al. [12] examine regularized boost- 
ing algorithms under absolute regularity and prove 
consistency. Karandikar and Vidyasagar [11] consider 
"probably approximately correct" learning algorithms, 
proving that PAC algorithms for IID inputs remain 
PAC with /3-mixing inputs under some mild condi- 
tions. Ralaivola et al. [19] derive PAC bounds for 
ranking statistics and classifiers using a decomposition 
of the dependency graph. Finally, Mohri and Ros- 
tamizadeh [15] derive stability bounds for /3-mixing 
inputs, generalizing existing stability results for IID 
data. 

All these results assume not just /3-mixing, but known 
mixing coefficients. In particular, the risk bounds 
in [14, 15] and [19] are incalculable without knowl- 
edge of the rates. This knowledge is never available. 
Unless researchers are willing to assume specific val- 
ues for a sequence of /3-mixing coefficients, the results 
mentioned in the previous paragraph are generally use- 
less when confronted with data. To illustrate this defi- 
ciency, consider Theorem 18 of [15]: 

Theorem 1.1 (Briefly). Assume a learning algorithm 
is X-stable. Then, for any sample of size n drawn from 
a stationary /3-mixing distribution, and e > 

P(|i? - ^1 > e) < F(7i, A, e, a, b) + /3(a)(Ai„ - 1) 

where n = (a + b)fin, F has a particular functional 
form, and R — R is the difference between the true risk 
and the empirical risk. 

Ideally, one could use this result for model selection 
or to control the size of the generalization error of 
competing prediction algorithms (support vector ma- 
chines, support vector regression, and kernel ridge re- 
gression are a few of the many algorithms known to 
satisfy A-stability) . However the bound depends ex- 
plicitly on the mixing coefficient /3(a). To make mat- 
ters worse, there are no methods for estimating the 
/3-mixing coefficients. According to Meir [14, p. 7], 
"there is no efficient practical approach known at this 
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stage for estimation of mixing parameters." We begin 
to rectify this problem by deriving the first method for 
estimating these coefficients. We prove that our esti- 
mator is consistent for arbitrary /3-mixing processes. 
In addition, we derive rates of convergence for Markov 
approximations to these processes. 

Apphcation of statistical learning results to /3-mixing 
data is highly desirable in applied work. Many com- 
mon time series models are known to be ^-mixing, 
and the rates of decay are known given the true pa- 
rameters of the process. Among the processes for 
which such knowledge is available are ARMA mod- 
els [16], GARCH models [4], and certain Markov pro- 
cesses — see [8] for an overview of such results. To 
our knowledge, only Nobel [17] approaches a solution 
to the problem of estimating mixing rates by giving 
a method to distinguish between different polynomial 
mixing rate regimes through hypothesis testing. 

We present the first method for estimating the /3- 
mixing coefficients for stationary time series data. Sec- 
tion 2 defines the /3-mixing coefficient and states our 
main results on convergence rates and consistency for 
our estimator. Section 3 gives an intermediate result 
on the Li convergence of the histogram estimator with 
/^-mixing inputs. Section 4 proves the main results 
from §2. Section 5 concludes and lays out some av- 
enues for future research. 

2 Estimation of /3-mixing 

In this section, we present one of many equivalent def- 
initions of absolute regularity and state our main re- 
sults, deferring proof to §4. 

To fix notation, let X = {Xt}^_^ be a sequence of 
random variables where each Xt is a measurable func- 
tion from a probability space {fl, T , P) into a measur- 
able space X . A block of this random sequence will 
be given by X:^ = \Xi\\^^ where i and j are integers, 
and may be infinite. We use similar notation for the 
sigma fields generated by these blocks and their joint 
distributions. In particular, a\ will denote the sigma 
field generated by X^, and the joint distribution of X;^ 
will be denoted . 

2.1 Definitions 

There are many equivalent definitions of /3-mixing (see 
for instance [8], or [3] as weU as Meir [14] or Yu [27]), 
however the most intuitive is that given in Doukhan 
[8]. 

Definition 2.1 (/3-mixing). For each positive inte- 
ger a, the the coefficient of absolute regularity, or j3- 



mixing coefficient, P{a), is 

/3(a)^sup||P*_^®P-,-P,,,||^^ (1) 
t 

where \ \ ■ \\tv is the total variation norm, and Pt,a is 
the joint distribution of (XLg^,X^^). A stochastic 
process is said to be absolutely regular, or /3-mixing, 
if /3(a) — > as a ^ OO. 

Loosely speaking. Definition 2.1 says that the coeffi- 
cient /3(a) measures the total variation distance be- 
tween the joint distribution of random variables sea- 
parted by a time units and a distribution under which 
random variables separated by a time units are in- 
dependent. The supremum over t is unnecessary for 
stationary random processes X which is the only case 
wc consider here. 

Definition 2.2 (Stationarity) . A sequence of ran- 
dom variables X is stationary when all its finite- 
dimensional distributions are invariant over time: for 
all t and all non-negative integers i and j , the random 
vectors X^"*"* and X^^*^-' have the same distribution. 

Our main result requires the method of blocking used 
by Yu [26, 27]. The purpose is to transform a sequence 
of dependent variables into subsequence of nearly IID 
ones. Consider a sample X" from a stationary /3- 
mixing sequence with density /. Let m„ and be 
non-negative integers such that 2TO„/i„ = n. Now di- 
vide X" into 2^n blocks of each length m„. Identify 
the blocks as follows: 

Uj = {X, : 2{j - 1)?«„ + 1 < t < (2j - l)m„}, 
Vj ^ {X, : (2j - l)m„ + 1 < i < 2jm„}. 

Let U be the entire sequence of odd blocks Uj , and let 
V be the sequence of even blocks Vj. Finally, let U' 
be a sequence of blocks which are independent of X" 
but such that each block has the same distribution as 
a block from the original sequence: 

U'^=Uj=Ui. (2) 

The blocks U' are now an IID block sequence, so stan- 
dard results apply. (See [27] for a more rigorous analy- 
sis of blocking.) With this structure, we can state our 
main result. 

2.2 Results 

Our main result emerges in two stages. First, we rec- 
ognize that the distribution of a finite sample depends 
only on finite-dimensional distributions. This leads to 
an estimator of a finite-dimensional version of /3(a). 
Next, we let the finite-dimension increase to infinity 
with the size of the observed sample. 
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For positive integers t, d, and a, define 



it 

■ t-d+l 



■ t+a 



■ t.a.d] 



TV ■ 



(3) 



where 



is the joint distribution of 



(X*+^+i, X^^+'^^i). Also, let f be the d-dimensional 
histogram estimator of the joint density of d consec- 
utive observations, and let /^'^ be the 2(i-dimensional 
histogram estimator of the joint density of two sets of 
d consecutive observations separated by a time points. 

We construct an estimator of /3'^(a) based on these two 
histograms.^ Define 



(4) 



We show that, by allowing d = dn to grow with n, 
this estimator will converge on /3(a). This can be seen 
most clearly by bounding the Li-risk of the estimator 
with its estimation and approximation errors: 

The first term is the error of estimating f3'^{a) with a 
random sample of data. The second term is the non- 
stochastic error induced by approximating the infinite 
dimensional coefficient, /3(a), with its d-dimensional 
counterpart, /3'^(a). 

Our first theorem in this section establishes consis- 
tency of Z?''" (a) as an estimator of /3(a) for all /3-mixing 
processes provided dn increases at an appropriate rate. 
Theorem 2.4 gives finite sample bounds on the esti- 
mation error while some measure theoretic arguments 
contained in §4 show that the approximation error 
must go to zero as (i„ — >■ oo. 

Theorem 2.3. Let X" be a sample from an arbitrary 
/3-mixing process. Let rf„ = 0(exp{VF(logn)}) where 

W is the Lambert W function.'^ Then 13'^" (a) /3(a) 



A finite sample bound for the approximation error is 
the first step to establishing consistency for /3''" . This 
result gives convergence rates for estimation of the fi- 
nite dimensional mixing coefficient /3'^(a) and also for 
Markov processes of known order d, since in this case, 
/?''(«) = /?(«)• 

Theorem 2.4. Consider a sample X" from a station- 
ary /3-mixing process. Let fi„ and nin be positive inte- 



^While it is clearly possible to replace histograms with 
other choices of density estimators (most notably KDEs), 
histograms in this case are more convenient theoretically 
and computationally. See §5 for more details. 

^The Lambert W function is defined as the (mul- 
tivalued) inverse of f{w) = ™exp{™}. Thus, 
0(exp{W(log 7i)}) is bigger than O(loglogn) but smaller 
than 0{\ogn). See for example Corless et al. [5]. 



gers such that 2/i„TO„ = n and fin > d > 0. Then 
^'^(a)-/3'^(a)|>e) 



< 2 exp 



2 exp 



+ 4:{fJ.n ~ l);9(m„), 



and £2 = 6 



where ei = e/2 - E / \f'' - /^| 



Consistency of the estimator /3'^{a) is guaranteed only 
for certain choices of to„ and Clearly fin — >■ oo 
and fin/3(rnn) — as n — oo are necessary conditions. 
Consistency also requires convergence of the histogram 
estimators to the target densities. We leave the proof 
of this theorem for section 4. As an example to show 
that this bound can go to zero with proper choices of 
m„ and the following corollary proves consistency 
for first order Markov processes. Consistency of the 
estimator for higher order Markov processes can be 
proven similarly. These processes are algebraically /3- 
mixing as shown in e.g. Nummelin and Tuominen [18]. 

Corollary 2.5. Let X" be a sample from a first order 
Markov process with /3{a) — /3^{a) ~ 0{a~'^). Then 

under the conditions of Theorem 2.4, /3"'^(a) /3(a). 
Proof. Recall that n = 2/i„m„. Then, 

4(^„ - l)/3{mn) = 4^„/3(m„) -h4/3(m„) 

Tl 

= Ki to"'" -I- A'2m~'' 

ITLn 



if TO„ < n^/'^^+^'l for constants A'l and A'2. hi this 
case, we have that the exponential terms are less than 



exp 



for j = 1,2 and a constant K3. Therefore, both expo- 
nential terms go to as n 00. □ 

Proving Theorem 2.4 requires showing the Li con- 
vergence of the histogram density estimator with /3- 
mixing data. We do this in the next section. 

3 Li convergence of histograms 

Convergence of density estimators is thoroughly stud- 
ied in the statistics and machine learning literature. 
Early papers on the L^o convergence of kernel density 
estimators (KDEs) include [25, 2, 21]; Freedman and 
Diaconis [9] look specifically at histogram estimators, 
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and Yu [26] considered the Loo convergence of KDEs 
for /3-niixing data and shows that the optimal IID rates 
can be attained. Devroye and Gyorfi [7] argue that Li 
is a more appropriate metric for studying density esti- 
mation, and Tran [22] proves Li consistency of KDEs 
under a- and /3-mixing. As far as we are aware, ours is 
the first proof of Li convergence for histograms under 
/3-mixing. 

Additionally, the dimensionality of the target density 
is analogous to the order of the Markov approxima- 
tion. Therefore, the convergence rates we give are 
asymptotic in the bandwidth /i„ which shrinks as n 
increases, but also in the dimension d which increases 
with n. Even under these asymptotics, histogram esti- 
mation in this sense is not a high dimensional problem. 
The dimension of the target density considered here is 
on the order of exp{H^(logn)}, a rate somewhere be- 
tween logn and log log n. 

Theorem 3.1. // / is the histogram estimator based 
on a (possibly vector valued) sample X" from a (3- 
mixing sequence with stationary density f , then for all 
e>E\j\f-f\ 



!/-/!>£ <2cxp - 



+ 2(^„ - l)/3(m„) (5) 



where ei 



To prove this result, wc use the blocking method of Yu 
[27] to transform the dependent ^-mixing into a se- 
quence of nearly independent blocks. We then apply 
McDiarmid's inequality to the blocks to derive asymp- 
totics in the bandwidth of the histogram as well as the 
dimension of the target density. For completeness, we 
state Yu's blocking result and McDiarmid's inequality 
before proving the doubly asymptotic histogram con- 
vergence for IID data. Combining these lemmas allows 
us to derive rates of convergence for histograms based 
on /3-mixing inputs. 

Lemma 3.2 (Lemma 4.1 in Yu [27]). Let (p be a mea- 
surable function with respect to the block sequence U 
uniformly bounded by M . Then, 



|E[(/.]-E[0]| <M/3(m„)(;U„-l), 



(6) 



where the first expectation is with respect to the depen- 
dent block sequence, U, and E is with respect to the 
independent sequence, U'. 

This lemma essentially gives a method of applying IID 
results to /3-mixing data. Because the dependence de- 
cays as we increase the separation between blocks, 
widely spaced blocks are nearly independent of each 



other. In particular, the difference between expecta- 
tions over these nearly independent blocks and expec- 
tations over blocks which arc actually independent can 
be controlled by the /3-mixing coefficient. 

Lemma 3.3 (McDiarmid Inequality [13]). Let 
Xi , . . . , Xn be independent random variables, with Xi 
taking values in a set Ai for each i. Suppose that the 
measurable function / : — > M satisfies 

|/(x)-/(x')| <c, 

whenever the vectors x and x' differ only in the z*'' 
coordinate. Then for any e > 0, 



P(/ - E/ > e) < exp 



2e^ 



Lemma 3.4. For an IID sample Xi,...,Xn from 
some density f on , 



\f-Ef\dx = Oi^l/^nhij (7) 
|E/ - f\dx = 0{dhn) + 0{d^hl), (8) 



where f is the histogram estimate using a grid with 
sides of length hn- 

Proof of Lemma 3.4- Let pj be the probability of 
falling into the j*'' bin Bj. Then, 



E / \f-m = hij2K 



,7 



1 " 



h'^ 



Y.I{X^^B,) 



i=i 



= 0{n-''^)0{h-'''^)^o{l/^j^ 



For the second claim, consider the bin Bj centered at 
c. Let / be the union of all bins Bj. Assume the 
following: 



1- / G ^2 and / is absolutely continuous on /, with 
a.e. partial derivatives fi = ^-/(y) 

2. fi g L2 and fi is absolutely continuous on /, with 
a.e. partial derivatives fik = -^fi{y) 

3. fik e L2 for all i, k. 
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Using a Taylor expansion 

d 

/(x) = /(c) + ^(x, - c,)/,(c) + 0{<fhl), 

i=l 

where /,;(y) = ^/(y). Therefore, Pj is given by 



p, = / f{x)dx = hfjic) + 0{d'hf+') 



since the integral of the second term over the bin is 
zero. This means that for the bin, 

E^(x) - f{x) = - f{x) 



In = |(/U + /V). Now, 



= -Y,{x.,-c,)U{c)+0{<ehl). 



Therefore, 



E/„(a;) - f{x) 



L 

Ib. 



Y,(.^,-c,)Mc) + Oid'hl) 



1=1 

d 



+ 0{d'hl+'') 



^ 0{dht+^) + 0{d^hl+'') 

Since each bin is bounded, we can sum over all J bins. 
The number of bins is J = h~'^ by definition, so 



|E/„(x)-/(x)|dx 

o(\;'^)(o«+i) + o(d2/.^+^)) 

0{dK) + 0{d^hl). 

□ 



We can now prove the main result of this section. 



Proof of Theorem 3.1. Let g be the Li loss of the his- 
togram estimator, g = / |/ — /n|. Here fn{x) ~ 
:^J2^=iIi^i •= where Bj{x) is the bin con- 

taining X. Let /u, /v, and /u' be histograms based on 
the block sequences U, V, and U' respectively. Clearly 



\9 > e) 



\f-fn\>e 
f - fv , f - fyr 



< 



> e 



\j\f-h\ + \l l/-/v| >e 



< 



'u + .9v > 2e) 
fu > e) + P(.gv > e) 
2P(5u-E[5u] >e-E[5u]) 
2P(5u-E[5u'] > e-E[.gu']) 
fu - E[5u'] > ei), 



where ei = e — E[(7u']- Here, 
E[<?u'] <E / |7u' -E/u'Ma;- 



|E/u' - f\dx, 



so by Lemma 3.4, as long as for /i„ — > oo, /i„ \, and 
— > oo, then for all e there exists no(e) such that 
for all n > 7io(e)j e > = E[5u']- Now applying 
Lemma 3.2 to the expectation of the indicator of the 
event {g\j — E[gu'] > ci} gives 

2P(.gu - E[.gu'] > ei) < 2P(.gu' - E[.gu'] > ei) 
+ 2(/i„ - l)/?(m„) 

where the probability on the right is for the cr-field gen- 
erated by the independent block sequence U'. Since 
these blocks are independent, showing that g\ji sat- 
isfies the bounded differences requirement allows for 
the application of McDiarmid's inequality 3.3 to the 



blocks. For any two block sequences u'j^, . . 
m']^, . . . , u'^^ with u'^ ~ u'g for all i ^ j, then 

\gu'{u[, ...,u'^J- gu'iu'i, ...,u'^J\ 
\f{y;u[, . . . ,u'^J - f{y)\dy 

\f{y;u'i, . . ■,u'^J - f{y)\dy 



and 



< 



\f{y;u[,.. .,u^J ~ f{y;u[, . . .,u^J\dy 



fJ^n 



Therefore, 

> e) < 2P(.9u' - IE[.9U'] > ei) + 2(Ai„ - l)/?(m„) 
< 2exp |-^| + 2(^„ - l)/3(m„). 

□ 
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4 Proofs 

The proof of Theorem 2.4 rehes on the triangle in- 
equahty and the relationship between total variation 
distance and the Li distance between densities. 

Proof of Theorem 2.4- For any probability measures v 
and A defined on the same probability space with asso- 
ciated densities fi, and f\ with respect to some domi- 
nating measure tt, 



\j\u 



Let P be the d-dimensional stationary distribution of 
the d*'' order Markov process, i.e. P = 

Pt+a"''''"^ in the notation of equation 3. Let Fa,d be 
the joint distribution of the bivariate random process 
created by the initial process and itself separated by a 
time steps. By the triangle inequality, we can upper 
bound /3'^{a) for any d = dn- Let P and Pa^d be the 
distributions associated with histogram estimators 
and f^"^ respectively. Then, 

f3d(^a) = \\P®P-Va,d\\TV 

= P(E)P~P(E)P + P(E)P 



< 



< 2 



a.d + ^aM — '^a,d 
P ® P - P ®P 



TV 



TV 



P®P-PaM 



TV 



a,d ^a,d 
P -P 



TV 



TV 



P®P 



TV 



TV 
2d 22d\ 



2 \f a fa 



where i J \f'^<E)f'^ — fa'^\ is our estimator f3'^{a) and the 
remaining terms are the Li distance between a density 
estimator and the target density. Thus, 

fi'ia) - P%a) < j\f-r'\ + \j - fl\ 

A similar argument starting from (3"^ {a) = 
\\P ® P — Pa.dWxv shows that 

so we have that 



Therefore, 



> e 



< 



< 



/i./^-./^i>|)+pQ/ 



I r2d f^*^! \ ^ 

2 I l-'d Jal^2 



<2expi- 



+ 4{Hn - l);3(m„). 



2 exp 



where ei = e/2 - E / I/'' - f^l 



and 62 



E 



□ 



The proof of Theorem 2.3 requires two steps which are 
given in the following Lemmas. The first specifies the 
histogram bandwidth hn and the rate at which d„ (the 
dimensionality of the target density) goes to infinity. If 
the dimensionality of the target density were fixed, we 
could achieve rates of convergence similar to those for 
histograms based on IID inputs. However, we wish to 
allow the dimensionality to grow with n, so the rates 
are much slower as shown in the following lemma. 

Lemma 4.1. For the histogram estimator in 
Lemma 3.4, let 

dn ~ exp{W^(logn)}, 



with 



W^(log7i) -t- i logn 



logn (iexp{VK(logn)} + l) ' 
These choices lead to the optimal rate of convergence. 

Proof. Let /i„ = rt^'^" for some fc„ to be determined. 
Then we want n-^/^h~'^"^'^ = n^k^dr^-^)/-^ ^ o, 
dnhn = dnW^ 0, and d^K^ — d^n"^*^ all 
as n oo. Call these A, B. and C . Taking A and B 
first gives 

=> ^ [kndn - 1) log n log dn - kn log n 

^ fc„ log n Q(i„ + 1^ ^ log dn + i log 71 

\ogdn + \\ogn 
logn [^dn + 1) 

Similarly, combining A and C gives 

2 log dn + i log n 



logn(id„ + 2) 



(10) 
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Equating (9) and (10) and solving for dn gives 

^ dn^ cxp{H/(logn)} 

where W{-) is the Lambert W function. Plugging back 
into (9) gives that 

where 

^ VF(logn) + i logri 

logn (i exp{Vl^(logn)} + l) " 

□ 

It is also necessary to show that as d grows, /3''(a) — ;> 
/3(a). We now prove this result. 

Lemma 4.2. /3'^{a) converges to [3{a) as d oo. 

Proof. By stationarity, the supremum over t is un- 
necessary in Definition 2.1, so without loss of gen- 
erality, let i = 0. Let V^L^o be the distribution on 
cr°_^ = cr(- • . ,X_i,Xo), and let be the distribu- 
tion on fT^;^ = a-{Xa+i,Xa+2, ■ ■ ■)■ Let Pa be the 
distribution on cr = (7° ® '^a+i i^^'^ product sigma- 
field). Then we can rewrite Definition 2.1 using this 
notation as 

/3(a) = sup |P,(C)-[P"^®Pr](C)|. 
cea 

Let o'^^^^i and cr^^f be the sub-cr-fields of (t° and 
CT^i consisting of the d-dimcnsional cylinder sets for 
the d dimensions closest together. Let a'^ be the prod- 
uct cr-field of these two. Then we can rewrite P'^{a) 
as 

/3^(a) = sup ||Pa(C)- [P°^®P-](C)|. (11) 
ceo"* 

As such P'^{a) < f5{a) for all a and d. We can 
rewrite (11) in terms of finite-dimensional marginals: 

/3^(a) = sup |P.,4C) - [^\®K^\C)l 
ceo"' 

where '^a,d is the restriction of P to 
cr(X_d, . . . , Xo, Xa, . . . , Xa+d)- Because of the 
nested nature of these sigma-ficlds, we have 

/3'^H«)</3'^ («)</?(«) 

for all finite di < d2- Therefore, for fixed a, {/3''(a)}JJ^]^ 
is a monotone increasing sequence which is bounded 
above, and it converges to some limit L < /3(a). To 
show that L ~ /3(a) requires some additional steps. 

Let i? = Pa - [P° « ^T]^ which is a signed mea- 
sure on a. Let R'^ = FaM - [P"d ® K'^'^], which is 



a signed measure on ct''. Decompose R into positive 
and negative parts as i? = — Q~ and similarly for 
j^d ^ Q+d _ Q~d Notice that since R'' is constructed 
using the marginals of P, then R{E) = R'^{E) for all 
E G a'^. Now since R is the difference of probability 
measures, we must have that 

= R{n) = Q+{n) - Q-{n) 

^Q+{D)+Q+{D')-Q-{D)-Q-{D') (12) 
for all D e (T. 

Define Q = Q+ + Q". Let e > 0. Let C e cr be such 
that 

Q{C)^P{a)=Q+{C) = Q-{C-). (13) 

Such a set C is guaranteed by the Hahn decomposi- 
tion theorem (letting C* be a set which attains the 
supremum in (11), we can throw away any subsets 
with negative R measure) and (12) assuming without 
loss of generality that Pa(C) > [P° ^ «) Ff]{C). We 
can use the field tr/ = Ud''''' approximate a in the 
sense that, for all e, we can find A € af such that 
Q(AAC) < e/2 (see Theorem D in Halmos [10, §13] 
or Lemma A. 24 in Schervish [20]). Now, 

Q{AAC) = Q{A n C) + Q{C n A") 

^Q-{Af^C^) + Q+{C C^A'') 

by (13) since ^nC"= C and CnA= C C. Therefore, 
since Q{A/S.C) < e/2, we have 

Q-iAnC) <e/2 (14) 
Q+(A"nC) < e/2. 

Also, 

Q{C) = Q{AnC) + Q{A^nC) 
= Q+{Ar\C) + Q+{AT\C) 
<Q+{A) + e/2 

since AdC and A'^OC arc contained in C and AdC C 
A. Therefore 

Q+{A) > g(C)-e/2. 

Similarly, 

Q'{A) ^Q~{AnC)+ Q-{A n C") < + e/2 = e/2 

since A n C C C and Q-{C) = by (14). Finally, 

Q+'^iA) > Q+'^{A) - Q-'^{A) = R'^iA) 
= R{A) = Q+{A)-Q-{A) 
> Q{C) - e/2 - e/2 = Q{C) - e 
= /3(a) - e. 
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And since /3''(a) > (3+'^ (A), we have that for all e > 
there exists d such that for all di > d, 

l3'''{a)>f3''{a)>Q+''{A) 
>/?(a)-e. 

Thus, we must have that L = I3{a), so that /3'^{a) — > 
/3(a) as desired. □ 

Proof of Theorem 2.3. By the triangle inequality, 

|^'"(a) - /3(a)| < |^'"(a) - /3'"(a)| + I/?'" (a) - 

The first term on the right is bounded by the re- 
sult in Theorem 2.4, where we have shown that dn ~ 
0(exp{Vl^(log7i)}) is slow enough for the histogram 

estimator to remain consistent. That j3'^^{a) ^"~^°°> 
/3(a) follows from Lemma 4.2. □ 

5 Discussion 

We have shown that our estimator of the /3-mixing 
coefficients is consistent for the true coefficients /3(a) 
under some conditions on the data generating process. 
There are numerous results in the statistics and ma- 
chine learning literatures which assume knowledge of 
the /3-mixing coefficients, yet as far as we know, this 
is the first estimator for them. An ability to estimate 
these coefficients will allow researchers to apply ex- 
isting results to dependent data without the need to 
arbitrarily assume their values. Despite the obvious 
utility of this estimator, as a consequence of its novelty, 
it comes with a number of potential extensions which 
warrant careful exploration as well as some drawbacks. 

The reader will note that Theorem 2.3 does not pro- 
vide a convergence rate. The rate in Theorem 2.4 ap- 
plies only to the difference between fi'^{a) and (i'^{a). 
In order to provide a rate in Theorem 2.3, we would 
need a better understanding of the non-stochastic con- 
vergence of /3'^(a) to /3(a). It is not immediately 
clear that this quantity can converge at any well- 
defined rate. In particular, it seems likely that the 
rate of convergence depends on the tail of the sequence 
{/3(a)}r=i- 

Several other mixing and weak-dependence coefficients 
also have a total- variation fiavor, perhaps most no- 
tably a-mixing [8, 6, 3]. None of them have estimators, 
and the same trick might well work for them, too. 

The use of histograms rather than kernel density es- 
timators for the joint and marginal densities is some- 
what surprising and not entirely necessary. As men- 
tioned above, Tran [22] proved that KDEs are con- 
sistent for estimating the stationary density of a time 
series with /3-mixing inputs, so one could just replace 



the histograms in our esitmator with KDEs. However, 
KDEs suffer from two major issues. Theoretically, 
we need an analogue of the double asymptotic results 
proven for histograms in Lemma 3.4. In particular, 
we need to estimate increasingly higher dimensional 
densities as n — oo. This does not cause a problem 
of small-n-large-d since d is chosen as a function of n, 
however it will lead to increasingly higher dimensional 
integration. For histograms, the integral is always triv- 
ial, but in the case of KDEs, the numerical accuracy 
of the integration algorithm becomes increasingly im- 
portant. This issue could swamp any efficiency gains 
obtained through the use of kernels. However, this 
question certainly warrants further investigation. 

The main drawback of an estimator based on a den- 
sity estimate is its complexity. The mixing coefficients 
are functionals of the joint and marginal distributions 
derived from the stochastic process X, however, it is 
unsatisfying to estimate densities and solve integrals in 
order to estimate a single number. Vapnik's main prin- 
ciple for solving problems using a restricted amount of 
information is 

When solving a given problem, try to avoid 
solving a more general problem as an inter- 
mediate step [23, p. 30]. 

This principle is clearly violated here, but perhaps our 
seed will precipitate a more aesthetically pleasing so- 
lution. 
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